Project 1 - Inverted Indexer

Assigned: 1/22/08   Due: 1/29/08 at 4:30 PM

Review:

An inverted index is a mapping of words to their location in a set of documents. Most modern search engines utilize some form of an inverted index to process user-submitted queries. In its most basic form, an inverted index is a simple hash table which maps words in the documents to some sort of document identifier. For example, if given the following 2 documents:

We could construct the following inverted file index:

Data Set:

As your final data set, you will use a crawl of wikipedia articles. The data set is located on the DFS at /shared/wikipedia (roughly 12 GB of text in 193 files each sized roughly 64MB). Each wikipedia article is on its own line. A typical line of the data set (article) will start as follows:

Notice that the title of the article is delimited by an XML title tag, and the text of the article is delimited by an XML text tag. The links in the article to other wikipedia articles are delimited by "[[" and "]]". Also, certain links will look like this: "[[Longer Real Name|Short Name]]". These links appear in the article with "Short Name" but link to an article titled "Longer Real Name".

Part I:

Some words are so common that their presence in an inverted index is "noise" -- they can obfuscate the more interesting properties of a document. In addition to this, the data set is full of markup, punctuation, and variations in capitalization. For the first part of this project, run a word count over an interesting corpus (e.g. Shakespeare or wikipedia) to identify any noise words. Your goal is to remove these words from your final output as well as "scrub" punctuation and XML markup and normalize the words. One strategy to accomplish this might be multiple map-reduce passes, but you are free to do this however you want. Lastly, include any assumptions you made about punctuation removal/normalization of words in your write-up, extra karma for creating language-independent scrubbing algorithm!

Part II:

For this portion of the assignment, you will design a MapReduce-based algorithm to calculate the inverted index over the data. Your final inverted index should not contain the words identified in part I of this project. Also, the format of your MapReduce output (i.e. the inverted index) must be simple enough to be machine-parseable; it is not impossible to imagine your index being one of many data structures used in a search engine's indexing pipeline. Lastly, your submitted indexer should be able to run successfully on the wikipedia corpus. "Successfully" means it should run to completion without errors or exceptions, and generate the correct word->title mapping.

Part III:

In addition to the requirements detailed above, implement at least one extension of your choice. The extensions may be as simple or as complex as you'd like; the primarily goal is to give you a chance to play with the MapReduce system. We have provided some sample extensions that you are free to choose from:

Write-up:

  1. How long do you think this project would take to finish?
  2. How much time did you actually spend on this project?
  3. What noisy words did you find as a result of your WordCount? By what criteria did you decide where to draw the line? Were you surprised by any of the words?
  4. Which extensions did you do? Please detail any interesting results you found.
  5. Acknowledge any outside assistance you received from anyone except assigned course readings and the course staff.

Hints:

Turn-In instructions:

Please create an archive (preferably zip) that contains:

Do not include any of your output, gigabytes of text, grandma's meat loaf recipe, etc.
Turn the archive in to the catalyst drop box here before 4:30 on 1/29/08